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ABSTRACT 



A method for software pipelining nested loops combines the 
inner and outer loops of the nested loop to form a merged 
loop. One or more operations from the outer loop are 
activated on selected passes through the merged loop, and 
the merged loop is software pipelined. 

25 Claims, 5 Drawing Sheets 
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METHOD FOR SOFTWARE PIPELINING 
NESTED LOOPS 

BACKGROUND OF THE INVENTION 

1. Technical Field 5 
The present invention relates to methods for optimizing 

computer code, and in particular, to methods for software 
pipelining nested loops. 

2. Background Art 10 
Loops are software structures that allow programmers to 

perform repeated operations using a single set of instruc- 
tions. A typical source code loop begins with a loop 
instruction, e.g. a "Do", "While" or equivalent statement, 
followed by the set of instructions ("loop body") to be 15 
repeated. Arguments associated with the loop instruction 
control the repetition of the loop body. These arguments 
include a test for terminating the loop ("loop test"). The loop 
test is typically a logical function of a variable that is 
modified by the loop. It controls a branch instruction that 20 
either exits (terminates) the loop or returns to the first 
instruction of the loop body, depending on whether the test 
is true or false, respectively. In counted loops, the loop 
variable is an index that is incremented each time the 
instructions of the loop body are executed, and the loop test 25 
compares the index with a maximum value. 

Loops are nested when the body of one loop (the "outer 
loop") includes another loop (the "inner loop"). Perfectly 
nested loops are those in which the outer loop includes no 
instructions but those of the inner loop. Imperfectly nested 30 
loops are those in which the outer loop includes instructions 
in addition to those of the inner loop. In either case, each 
time the outer loop is executed, the instructions that form its 
loop body, including the inner loop, are executed. That is, 
the inner loop is fully executed on each repetition of the 35 
outer loop. The number of times the inner loop is executed 
for each iteration of the outer loop is a function of the inner 
loop test and the loop variable tested. 

Depending on how they are implemented, loops can have ^ 
a significant impact on the performance of a program. For 
example, the loop test is a branch condition which, if 
mispredicted, requires the processor to flush the current 
instructions from its pipeline, retrieve instructions from the 
correct branch path, and load these instructions into the 45 
pipeline. Misprediction is likely in loops since the branch is 
taken on all but the final iteration of the loop, and history- 
based branch prediction algorithms will predict the branch 
taken on the final iteration. The resulting branch mispredic- 
tion is repeated every time the loop is entered. For nested 5Q 
loops, the inner loop is entered on each iteration of the outer 
loop, and the performance hit attributable to mispredictions 
can be significant. 

Program performance can also be degraded by the over- 
head necessary to set up and terminate each loop. For nested 55 
loops, this overhead is multiplied, since the cost is incurred 
each time the instructions of the outer loop are repeated. If 
the outer loop repeats 100 times, the overhead for the inner 
loop is incurred 100 times. The smaller the loop body is, 
relative to this overhead, the greater the efficiency cost of the 60 
loop. 

A number of methods have been developed to improve the 
efficiency with which loops (nested or otherwise) are imple- 
mented. For example, software pipelining takes advantage 
of the fact that the loop body instructions are repeated on 65 
each iteration of the loop by implementing the instructions 
for different iterations of the loop in parallel. In a loop body 
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of three instructions, the first instruction may operate on 
variables for the i** pass through the loop ("iteration"), while 
the second and third instructions are implemented with 
variables from the (i-1)" and (i-2)*' iterations. 

Under certain circumstances, the overhead cost of nested 
loops may be mitigated somewhat by "unrolling and jam- 
ming" the outer loop. Here, the instructions of the outer loop 
body for sequential iterations are combined for processing in 
a single iteration of a modified loop index. Each iteration of 
the outer loop then executes instructions for multiple, 
sequential values of the modified loop index, including the 
inner loop instructions. In addition, the outer loop instruc- 
tions may be rearranged within the expanded loop body, 
instruction dependencies permitting, to further streamline 
execution of the loop. 

These methods, where applicable, increase the size of the 
loop body. The size of the loop body determines the number 
of instructions (scope) that a compiler can consider 
simultaneously, for implementing an optimization process. 
To the extent that these techniques increase the number of 
instructions in the loop body, they may enable additional 
compiler optimizations. 

Despite their potential advantages, the above described 
techniques for handling loops are typically limited. For 
example, loop overhead is only reduced to the extent an 
outer loop can be unrolled, and this may be limited by 
dependencies between the inner and outer loop instructions. 
In addition, it is often practical to implement loop unrolling 
and similar techniques for only the two inner most loops of 
a set of nested loops. Some of these limitations are not 
present in perfectly nested loops, but imperfectly nested 
loops are very common and subject to most of these limi- 
tations. 

SUMMARY OF THE INVENTION 

The present invention is a method for software pipelining 
nested loops. In accordance with the present invention, the 
inner and outer loops of a nested loop are combined to form 
a merged loop. One or more operations from the merged 
loop are conditioned to be activated on selected passes 
through the merged loop. 

In one embodiment of the invention, instructions from the 
inner and outer loops are merged and outer loop instructions 
are selectively activated using predication. A predicate con- 
dition is defined for each predicate so that the predicate 
condition is true when the associated instruction is to be 
activated. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention may be understood with reference 
to the following drawings in which like elements are indi- 
cated by like numbers. These drawings are provided to 
illustrate selected embodiments of the present invention and 
are not intended to limit the scope of the invention. 

FIG. 1 represents a loop following software pipelining. 

FIGS. 2A and 2B represent nested loops following con- 
ventional software pipelining methods. 

FIG. 3 represents a nested loop that has been software 
pipelined using a method in accordance with the present 
invention. 

FIG. 4 is a flowchart representing an overview of the 
method for software pipelining nested loops in accordance 
with the present invention. 

FIG. 5 is a more detailed flowchart showing one embodi- 
ment of the method of FIG. 4. 
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DETAILED DESCRIPTION OF THE 
INVENTION 

The following discussion sets forth numerous specific 
details to provide a thorough understanding of the invention. 
However, those of ordinary skill in the art, having the benefit 
of this disclosure, will appreciate that the invention may be 
practiced without these specific details. In addition, various 
well known methods, procedures, components, and circuits 
have not been described in detail in order to focus attention 
on the features of the present invention. 

The present invention provides a method for comb ining^ 
operations fronlUwoTor more nested loopsinio^a;merj^-i6o^ 
and software pT aj ^ffift^rie rg^-fow 
pipelined merged loop offers multiple advantages over the 
nested loop structure from which it is formed. For example, 
the loop overhead penalty associated with initiating and 
terminating the inner loop on each iteration of the outer 
loop(s) is significantly reduced as the number of separate 
loops in the nested structure are reduced. For a pair of nested 
loops pipelined in accordance with the present invention, the 
inner loop overhead cost is incurred only once. The branch 
mispredictions associated with the individual loops in the 
nested structure are like wise reduced as the number ofloops 
is reduced, y£he m ergedldop has a lar ger loop body, .-wjiich 
mcfeases'thKffitillcfibTr^ciTpe 10 'vftucn various compiler 

nptin *imrinftfi» w^ 

better use of processor resources and increases the oppor- 
tunities for prefetching data. 

In one embodiment of the present invention, these and 
other advantages are provided by combining instructions 
fronrtwaof more loops into a merged loop and assoc iating 
a^Ttidi&fTe^with'^elected instructions in the merged loop. 
The predicate condition for a selected instruction is chosen 
to activate the instruction (or results generated by the 
instruction) during appropriate iterations of the merged loop. 
For example, the predicate cond itions^ for^ an. instruction 
fjomgugiven Joop may:be 'bgSirpo^te t^mj^opTt^ j 
is m Tj^ai ' td i thc - giv^^ 

Othe rrembbHimentS: of the ,, mveji tioa.>B W ai em p lov -ofer; 
methoas to selectively activate outer loop instructions or 
their effects. 

The method of the present invention may be better 
understood with reference to standard software pipelining 
techniques. A pseudo code representation of a counted Do 
loop is: 



DO (initialize^), test(L), updatc(L)) 
a 
b 

ENfDDO 



Loop (I) 
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translate to machine language instructions A} B, and C. In a 
software pipeline 100, the different instructions correspond 
to the stages of a pipeline. Instructions in a given row of 
pipelined loop 100 are processed concurrently, and each 
instruction is evaluated for increasing values of the loop 
variable L in sequential rows. For purposes of illustration, 
the loop variable is indicated in parenthesis following each 
instruction. For example A(l), B(3), and C(N-2) represent 
instructions A, B, and C evaluated using operands appro- 
priate for the 1" 3"*, and N^ iterations through the loop. 

During a prolog 160, the software pipeline 100 is filled. 
Thus, at cycle 140(1), instruction A is executed using the 
operands appropriate for Lol, e.g. A(l). At cycle 140(2), 
instructions A and B are executed using operands appropri- 
ate for L=2 and L«l, respectively, e.g. A(2), B(l). At 140(3), 
A(3), B(2), and C(l) are executed. During prolog 160, 
resources associated with instructions B and/or C are not 
utilized. For example, if A, B, and C are floating point 
instructions and loop 100 is executed in a processor having 
four floating point units (FPUs), three FPUs are idle at cycle 
140(1), two are idle at cycle 140(2) and one is idle at cycle 
140(3). Idle processor resources (waste 162) represent one 
component of loop overhead. 

At cycle 140(3), the software pipeline is finally filled, and 
instructions A, B, and C are evaluated concurrently for 
different values of L through cycle 140(N). For cycles 
140(3)-140(N) the slots of software pipeline 100 are full. At 
cycle 140(N), instruction A has been evaluated for all N 
iterations of loop (I). 

At cycles 140(N+1) and 140(N+2), software pipeline 100 
empties as instructions B and C complete their N iterations 
of loop 100. These cycles form an epilog 170 of software 
pipeline 100 for which resources associated first with A and 
then with B are idled. Idle processor resources (waste 172) 
represent another component of loop overhead. 

The significance dfeloop^overh eacf^fe^a given loop 
depends on the number pJ|tim^tnen5o^^Sterated each 
time it is entered, the numBer of instructions in the loop, and 
the number of times the loop is entered. The first two factors 
determine the number of rows for which the software 
pipeline 100 is full relative to the number of rows in the 
epilog and prolog, e.g. the overhead. The third factor deter- 
mines the number of times the overhead is incurred. In 
general, a loop that is nested inside another loop is fully 
iterated and its loop overhead is incurred each time the outer 
loop is entered. 

A pseudo code representation of an outer loop (II) includ- 
ing an inner loop (I) is: 



10 



15 



20 



25 



30 



35 



45 



50 



In this example, "DO 0*' is the loop instruction, instruc- 
tions "a" and "b" form the loop body, and "ENDDO" 
terminates the loop. The loop variable, L, tracks the number 
of iterations of loop(I), initialize(L) represents its initial 
value, and update(L) indicates how L is modified on each 
iteration of the loop. Test(L) is a logical function of L, e.g. 
LooLMAX, that terminates loop (I) when it is true, passing 
control to instruction "e". Other types of loops, e.g. 
"WHILE" and "FOR" loops, follow a similar pattern, 
although they may not explicitly specify an initial value, and 
the loop variable may be updated by instructions in the loop 
body. 

FIG. 1 represents loop (I) following software pipelining. 
Here, it is assumed that source code instructions a, b 



55 



DO (initialize^, test(J), update(J)) 
g 

DO(iniiifllizc(L), test(L), updatc(L)) 
a 
b 

ENDDO 
h 

ENDDO 



Loop® 



Loop(ll) 



In the disclosed example, outer loop (II) includes instruc- 
tions g, h and loop (I) within its loop body. Test(L) and 
test(J) represent loop termination conditions L— LMAX and 
J--JMAX. Thus, each repetition of loop (II) executes 
instruction g, followed by the iterations of loop (I) 
(instructions a and b), followed by instruction h. Loop index 
J is then incremented and the process repeated up to 
J-JMAX. When nested loop (II) is compiled, loop(I) is 
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generally software pipelined in the manner described in 
conjunction with FIG. 1. 

FIG. 2A represents the JMAX iterations of outer loop (II). 
For purposes of illustration, it is assumed that source code 
instructions g, h are translated to assembly instructions G, H. 
For J=l, instruction G of outer loop (II) is executed, fol- 
lowed by LMAX iterations of loop (I) (parallelogram 200), 
followed by instruction H of outer loop (II). This process is 
repeated for J=2 through JMAX. As indicated, each time 
loop (0 is entered, loop overhead is incurred in the form of 
unused instruction slots associated with prolog 160 and 
epilog 170 (FIG. 1). 

In the example of FIG. 2A, instruction A is assumed to 
depend on instruction G. and instruction H is assumed to 
depend on instruction C. Thus, instruction G, loop(I), and 
instruction H are executed sequentially. FIG. 2 A also rep- 
resents the case where instructions A, B, and C fully utilize 
processor resources, e.g. FPUs, that are also required by 
instructions G and H. 

FIG. 2B represents nested loops (I), (II) where instruc- 
tions G and H can be processed concurrendy with inner loop 
instructions A and C, respectively, e.g. A does not depend on 
instruction G, instruction H does not depend on instruction 
C, and sufficient processor resources are available to process 
all instructions. This provides some speed up in the pro- 
cessing of nested loops (I), (II). However, it does not address 
the performance loss associated with repeated prologs and 
epilogs of loop (I). Nor does it address the branch mispre- 
dictions associated with terminating loop (I) for each itera- 
tion of outer loop (II). 

The present invention allows two or more loops to be 
merged and software pipelined as a_sing le loop, increasing 
the scope of instructions ^vailirTif ; it or cbmpilejf 
optimizations, reducing the ovei 



Isso^ate^'^with^fiUirrg 
and emptying the software pipeline, and reducing branch 
mispredictions attributable to repeated entry and exit of the 
inner loop. 

A pseudo-code representation of nested loops (I), (II) 
modified in accordance with one embodiment of the present 
invention is: 



J ITER - [(JMAX - JSTART)/JLNC] + 1 
LITER = [(LMAX - LSTART)/LINC] + 1 
J - J START - JINC 
L- LSTART 

DO I - 1, JITER* LITER, 1 
IF (L .EQ. LSTART) THEN 

J - J + JINC 

OUTERLOOP_TOP 
ENDIF 

t NNER LOOP_BOD Y 

L- L+ LINC 

IF (L .GT. LEND) THEN 

OUTERLOOP_BOTTOM 

L- LSTART 
ENDIF 
ENDDO 



Outer loop instructions g and h, represented by 
OUTERLOOP_TOP and OUTERLOOP_BOTTOM, 
respectively, and inner loop instructions a and b, represented 
by INNERLOOP_BODY, are combined in a single, merged 
loop. A composite loop variable, I, for the merged loop, 
varies from 1 to JITER* LITER, and conditionals are 
inserted in the merged loop. In the above example, the 
conditional, IF(L .EQ. LSTART), picks up those iterations of 
the merged loop for which the inner loop of the original 



nested structure is reentere^^gr , L7^Q. LSTART. When this 
conditional is true, J is incremented and OUTERLOOP_ 
TOP instruction^) is activated. Otherwise, these steps are 
skipped. Similarly, the conditional, IF(L.GTigro^pjcJ« 
5 up those iterations of the merged loop f or^i^the jn^ 
loTJ^flhT original nested structure ^^xile'd7\Vhffi ^uSi 1 
^n^in na M ? ^piP . ot itp^ loop^BOTTOM inst riictieri 
(s) is activated and L is reinitialized. Otherwise, these steps 
are skipped. 

10 In the disclosed embodiment, the outer loop instructions 
are executed only for those iterations of composite variable 
I for which the original outer loop variable changes, i.e. prior 
to entering the inner loop and subsequent to completing the 
inner loop. The resulting merged loop may be software 

15 pipelined into a compact structure that significantly reduces 
loop overhead for the inner loop and provides a larger loop 
body on which additional optimizations may be imple- 
mented. 

The present invention may be implemented using varia- 
tions on the approach described above. In certain cases, 
references to the inner and outer loop variables to activate 
the conditionals may be eliminated. For example, where L 
varies from 1-10 and J varies from 1-10, the merged loop 
variable I goes from 1-100. Outer loop instructions can be 
activated on iterations for which I Mod 10 equals 0. In 
addition, a single conditional may be used to test for the end 
of the mnerJojog_anQlactivate thejnstructions represented by 
t^ ^B I^b lCBOTfOM and OUTERLOOPTTOR Ot^ er 
^saSfSins will be apparent to persons skilled in trie art arlCJ ' 11 • 
having the benefit of this disclosure. 

FIG. 3 is a schematic representation of nested loops (I), 
(II) that have been modified and pipelined as a single loop 
in accordance with the present invention. In order to illus- 
trate the flow of instructions through pipeline 300, each 
instruction is identified by a pair of indices (J, L), These 
indices indicate that the instructions are evaluated using 
operands suitable to the J** iteration of the outer loop and the 
L** iteration of the inner loop. For example, A(l,3) refers to 
instruction A when it is executed using operands appropriate 
for the first iteration of the outer loop (J=l) and the third 
iteration of the inner loop (L=3). It is emphasized that 
software pipeline 300 is based on the single merged loop for 
which a single loop index I is operative. I varies between 1 
and JITER*LITER to accommodate all combinations of 
inner and outer loop iterations in a single loop that is formed 
by merging outer loop (II) and inner loop (I) to a single loop 
with instructions G, A, B, C, H. In the disclosed example, 
JMAX=M and LMAX=K. The outer and inner loop indices 
are provided to facilitate tracking the instructions. 

For J-l, instructions G, A, B, C, H that form the merged 
loop are loaded into the slots of a software pipeline 300 
during a prolog 310. These instructions are subsequently 
drained from merged loop 300 in an epilog 320, when 
J^JMAXoM. Wasted instruction slots 312 and 322 are 
associated with prolog 310 and epilog 320, but not with the 
intervening increments of outer loop index J. During 
loading, G is activated for cycle 350(1) and deactivated for 
the next K cycles 350(2)-350(K+l), e.g. while the instruc- 
60 Uons of inner loop complete their first K iterations. The 
inactive state of G is indicated by no-operations (NOP(G)) 
in FIG. 3. A place holder for H (NOP(H)) is loaded into 
software pipeline 300 during prolog 310, but H is not 
activated until cycle K+4, following completion of the K 
65 iterations of inner loop (I). 

Dashed lines 330(1), 330(2) . . . 330(J-1) indicate where 
in software pipeline 300 instructions transition between 
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different values of outer loojflgSaex J occur. For example, G 
is activated at cycle K+^when the first instruction of the 
inner loop body has completed its first K iterations, A(1,K). 
In effect, G is turned on, temporarily, before the instructions 
of inner loop (I) begin a second set of K iterations at cycle 
K+2. At cycle K+4, when the last instruction of the inner 
loop body has finished its first K iterations, H is activated. 
Thus, H is turned on following completion of a full cycle of 
inner loop instructions. 

Cycles K+l through K+4, spanned by line 330(1), dem- 
onstrate one of the advantages of the present invention. 
Instead of draining instructions from software pipeline 300 
when outer loop variable J normally would increment, the 
present invention selectively activates the outer loop 
mstruction(sjy ^ifec prt^ inner loop instruc 



10 



tions. The ti miFi£witfcw^ 

activated takes into account any dependencies berwee'rTtrlCT 
outer jndjnn er_Igop instructions. InXfeullustrated example, 
itlisi aisTOe^ 

on C^coorSngly^(2,0) is activated in software pipeline 20 
during cycle K+l, when the K** iteration of A for the J=l w 
iteration of outer loop (II) is completing. This allows G to 
complete before A(2,l), e.g. the first iteration of A for J =2, 
is processed. Thereafter, the first instances of instructions B 



when appropriate during processing of the merged loop. In 
one embodiment of the invention, outer loop instructions 
may be selectively activated in the merged loop through 
predication, using appropriate predicate conditions, e.g. 
FIG. 3. In another embodiment, outer loop instructions may 
be executed on each iteration of the merged loop. In this 
embodiment, the results of the instructions may be commit- 
ted on only selected iterations using, for example, condi- 
tional moves. The present invention is not limited to any 
particular method for selectively activating outer loop 
instructions or their effects on the program. 

At step 430, the merged loop is software pipelined. This 
is typically done at compile time as part of the opt imization 
procedure .^^Pje^comprJ . v 

source code mtea^ ^mm^TO d^fwet ^a i y ) 7 aiidntljc trans- 
lated instructions are optimized. Once the merged loop is 
defined and the outer loop instructions are appropriately 
conditioned, standard software pipelining methods may be 
used to complete the process. 

FIG. 5 is a more detailed flowchart of one embodiment of 
method 400. At step 510, operations from the inner and outer 
loops are combined to form a merged loop. A loop variable 
and loop test are determined 520 for the merged loop from 
the loop variables and tests of the minner and outer loops. 



30 



Jp_op_the-original inner loop ii 
^exi tcon d: 
thatpreceaelfiejnne^ 
conditional. 



and C for the J=2 iteration of outer loop (II) occur in cycles 25 conditionals are defined 530 to pick out where in the merged 



K+3 and K+4, respectively. 

Thus, software pipeline 300 is uninterrupted as sequentifr* 
passes through inner loop instructions are processed. In 
particular, there is no need to drain and refill software 
pipeuife^^^femnef7 : 16op^instructions before and _after_ 
executirrjflj^SiW'&fly^^ 

last iteration of instruction C for the J-l loop has completed. 

Merged software pipeline 300 also eliminates most 
branch mispredictions associated with the termination 
condition^ TfestfLV These^mispredictions are substantially 

-i: — — ^idMng^repe^ 



40 



45 



and 

outer loop 
^jM)jfein'g ItiB entry 
Operations originating in the outefTbop that 
follow the inner loop are predicated 550 using the exit 
conditional. The merged loop is then software pipelined 560. 
As noted above, this may be done using standard techniques. 
Moreover, additional compiler optimizations may be applied 
to instructions of the merged loop to furter enhance perfor- 
mance of the pipelined instructions. 

The present invention has been described in detail for the 
case in which an inner loop has been combined with 
instructions from an outer loop. Persons skilled in the art, 
having the benefit of this disclosure, will recognize that the 
present invention may be used to combine an inner loop with 
more than one outer loop. In addition, the use of conditionals 
in general, and predicates, in particular, may be applied to 
instructions of the inner loop, to further facilitate software 
pipelining of the merged loop. In the disclosed embodiment, 
for example, the inner loop instructions may be predicated 
to turn on selectively during prolog 310, as needed, to fill the 
instruction slots in software pipeline 300. In addition, the 
inner loop instructions may be predicated to selectively turn 
off during epilog 320, as needed, to drain the instruction 
slots in software pipeline 300. 
In the exemplary embodiments, pipelined instructions 
[jGjpe=a r cdlnposit^ been shown executing for sequential values of the loop 

^' Ttcounted/non-co unted nature of the com ponent variable, e.g A(N) B(N+1) C(N+2). This is not always 
loops, argbthe loop test for the merged loop is the logical! 55 possible since instructions may have relatively long 
AND o^^jSop^csis^t the component loops. As in the latencies, in which case dependent instructions must be 
counted loop example, the loop test(s) of the inner loop(s) is loaded into the pipeline in a manner that accommodates the 
monitored to determine when operations of the outer loop(s) latency. For example, if A takes three clock cycles to 
st^ifl^ K^activate dr'For example, OUTERLOOP_TOP complete and B depends on A, the instructions may be 
orreHHffls^irffa'ftrr^^ inner. . loop^v ariable is 60 scheduled onto the pipeline as follows: A(N) B(N+3) C(N+ 

initialized, and I OUTERLOOP —BOTTOM oper ations+are 4). The present invention may be applied to nested loops, 
activated whef^^Snner 4obp -test evaluates trug^ whether or not such dependency issues exist. 

FIG. 4 is a row chart showman overview of a method It is further noted that the arrangement of instructions 
400 in accordance with the present invention for pipelining within a given cycle of software pipeline 300 follows a 
nested loops. At step 410, the inner and outer loops are 65 standard form for indicating the filling and emptying of the 
combined to form a merged loop. Selected outer loop instruction slots. It is noted, however, that the instruction 
operations are then conditioned 420 so they are activated dependence is reflected in the relative placement of rows of 



eliminal 
instruct^ 
gjffoaiuesa^ 

tliafffti|p^iiue l AAb , sinc^puter lo^ p ins^u^ions <> and^ffare 
implemented by p^Wsiy unuse3 resources. Mer^Sloop 
pipeline 300 thus provides the compiler with greater scope 
(more instructions) for various other compiler optimizations. 

FIG. 3 represents nested counted loops that have been 
modified in accordance with the present invention, but the 
present invention is applicable to nested loops of any type. 
For example, nested loops that include various types of 
non-counted loops mayj^mergedjmdjji pelined using the 
present invention. Tj^lo w^alia bles tested bylthese loops j 
to^determine-when' tPfflimnale iiiUy be^ddjus1c%%y^nc£0'r^5o 
naorcAQpcrations. w ithin the loop, in contrast to the simple 
increment/decrement scheme of counted loops. In this more 
genei 
reflei 
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instructions, rather than the placement of individual instruc- 8. The method of claim 7, wherein combining instructions 

tions within a given row. Accordingly, one embodiment of comprises: 

software pipeline 300 may be represented in an alternative defining a merged loop body to include operations from 

form that emphasize the role of predication in turning on and the inner and outer loops; 

off both inner and outer loop instructions. 5 defining a loop variable for the merged loop body from 

P(1)*G P(2)*A, P(3)*B . . . P(J)*INST . . . P(M)*H. loop variables of the inner and outer loops; and 

In this representations, predicates (conditionals) for the defining a merged loop test from loop tests associated 

different instructions are represented by P(J), where the with the inner and outer loops. 

index is included to distinguish predicates for different 9. The method of claim 7, wherein predicating one or 

instructions. The various predicates activate/deactivate their 10 m0 re operations comprises: 

associated instructions as necessary to fill the software defining the predicate t0 be true accord ing to a selected 

pipeline and execute outer loop instructions at appropriate property of the inner loop; and; 

junctures in the merged loop. Predicate conditions associ- f - „ c * . ' . 4 . 

J . . , ..... j c . . 1 j .. . gating one or more of the outer loop operations with the 

ated with each instruction are defined to activate/deactivate predicate 

the instruction as needed. 15 10 ^ method of claim 9> wherein defining the predicate 

There has thus been provided a method for software comprises defining a first predicate to be true when an inner 

pipelining nested loops by combining instructions from the j 00 p test j s lruei 

inner and outer loops of the nested loop structure into a u method of claim 9> whgrein defining me predicate 

merged loop. Conditionals are added to the outer loop condition comprises defining a second predicate to be true 

instructions in the merged loop to selectively activate these when an inner loop variable ^ initialized, 

instructions where appropriate. The merged loop, including 12 . A method for preparing nested inner and outer loops 

the conditionals, is then software pipelined using standard for processing, the method comprising: 

compiler methods. combining operations from the inner and outer loops; 

What is claimed is: . ~ . 0 . . . c . 

1. A method for processing nested inner and outer loops 25 deflnm S. f ™'f* Ioo P van * ble , from loo P var,ables 

r & r associated with the inner and outer loops; 

comprising: , „ . r ' . 

- - i . * « * denning a merged loop test from loop tests associated 

fortmng a merged loop from the inner and outer loops; ^ ^ ^ ^ ^ 

. . • c j gating one or more operations from the outer loop on a 

condiboning one or more operations from the merged 30 derived fmm ^ ^ , 

ooptobeactivatedonselectediteranonsofthe merged ly ^ rf ^ n wherein gating comprises 

„ 2?^" . . ,* • • gating one or more operations from the outer loop according 

2. The method of claim 1. wherein cond.tiomng com- to a daWed from (hc , oop ^ of ^ ijma loop 

P nsGS ■ 14. The method of claim 12, wherein gating comprises 

identifying one or more instructions from the merged 35 gating one or more operations from the outer loop according 

loop; and to a coa dition derived from an initial state of an inner loop 

predicating the one or more merged loop instructions. variable. 

3. The method of claim 2, wherein predicating comprises: 15. A method for processing a nested loop of inner and 
identifying a loop test and initial loop variable for the outer loop instructions as a merged loop, the method corn- 
inner loop; 40 prising: 

defining a first predicate that is true when the loop test is executing inner loop instructions for a given iteration of 

satisfied; and the merged loop; 

defining a second predicate that is true when the loop evaluating one or more conditions according to a loop test 

variable is in its initial state. and loop variable associated with the inner loop 

4. The method of claim 1, wherein conditioning com- instructions; and 

prises: gating one or more outer loop instructions according to 

identifying one or more results associated with the one or Ihe one or more conditions. 

more merged loop operations; and The method of claim 15, wherein evaluating corn- 
conditioning the one or more results to be available to 50 P nscs: 

instructions of the merged loop on selected iterations of evaluating a first condition that is true when the loop test 

the merged loop. of the inner loo P fe true i and 

5. The method of claim 1, wherein conditioning com- evaluating a second condition that is true when the inner 
prises conditioning one or more instructions from the loop variable is in an initial state. 

merged loop to be active on selected iterations of the merged 55 17. The method of claim 15, wherein gating comprises: 

loop. executing one or more outer loop instructions that precede 

6. The method of claim 1, wherein conditioning com- the inner loop instructions when the second condition is 
prises conditioning one or more results associated with one true; and 

or more instructions from the merged loop to be available on executing one or more outer loop instructions that follow 

selected iterations of the merged loop. 60 the inner loop instructions when the first condition is 

7. A method for software pipelining instructions from true. 

inner and outer loops of a nested loop comprising: 18. A machine readable storage medium on which are 

combining operations of the inner and outer loops to form stored instructions that may be executed by a processor to 

a merged loop; and implement a method for processing nested inner and outer 

predicating one or more operations of the combined loop 65 loops, the method comprising: 

to activate the predicated instructions on selected itera- forming a merged loop from the inner and outer loops; 

tions of the merged loop. and 
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conditioning one or more operations from the outer loop 
to be activated on selected iterations of the merged 
loop. 

19. The machine readable medium of claim 18, wherein 
conditioning comprises: 5 

identifying one or more instructions from the outer loop; 
and 

predicating the one or more outer loop instructions. 

20. The machine readable medium of claim 19, wherein 
predicating comprises: 

identifying a loop test and initial loop variable for the 
inner loop; 

defining a first predicate that is true when the loop test is 
satisfied; and 15 

defining a second predicate that is true when the loop 
variable is in its initial state. 

21. The machine readable medium of claim 18, wherein 
conditioning comprises: 

identifying one or more results associated with the one or 20 

more outer loop instructions; and 
conditioning the one or more results to be available to 

instructions of the merged loop on selected iterations of 

the merged loop. 



22. A machine readable medium on which are stored 
instructions that may be executed by a processor to imple- 
ment a method comprising: 

executing an iteration of a merged loop, the merged loop 
including inner and outer loop operations; 

testing a merged loop variable that is derived from an 
inner loop variable and an outer loop variable; and 

repeating executing and testing responsive to the merged 
loop variable having a first value. 

23. The machine readable medium of claim 22, wherein 
testing the merged loop variable comprises comparing the 
merged loop variable to a value determined from inner and 
outer loop tests. 

24. The machine readable medium of claim 22, wherein 
executing an iteration of the merged loop comprises execut- 
ing the outer loop operation if a first condition is met. 

25. The machine readable medium of claim 24, wherein 
executing an iteration of the merged loop comprises: 

evaluating a predicate to determine whether the first 

condition is met; and 
executing the outer loop operation if the first condition is 

met. 
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